Ollama: Run AI Locally — A Complete Guide

01 — What is Ollama

Your laptop as an AI server

Ollama is an open-source tool that lets you run large language models (LLMs) directly on your own computer — no internet required, no API keys, no subscription. Think of it as a download manager + inference engine for AI models, wrapped in a simple command-line interface.

When you run a model with Ollama, your CPU (and GPU if available) do all the heavy lifting. Your prompts never leave your machine. This is called on-device inference.

Ollama also exposes a local REST API on port 11434, meaning any tool that knows how to talk to an HTTP endpoint can use it — editors, scripts, harness tools like Codex CLI, and more.

📡 Key Insight

Ollama is not an AI model itself. It's the runtime — the engine that loads, manages, and serves open-weight models like Gemma, Llama, Mistral, Phi, and others.

Architecture

💻

Your App

Terminal, editor, script, browser

→

⚙️

Ollama

REST API :11434 model manager

→

🧠

LLM Model

Gemma, Llama, Mistral, Phi...

▼

🖥️

Your CPU/GPU

100% local inference

02 — History

How we got here

The ability to run LLMs locally didn't happen overnight. It's the result of years of research breakthroughs, open-source activism, and clever engineering.

2017

Transformers Paper — "Attention Is All You Need"

Google researchers published the transformer architecture that would become the backbone of every modern LLM. GPT, BERT, Llama — they all descend from this paper.

2020–2022

GPT-3 & The Closed AI Era

OpenAI released GPT-3 — powerful, but locked behind an API. Running it locally was impossible. The community began pushing for open alternatives.

Feb 2023

Meta releases LLaMA — the turning point

Meta's LLaMA model weights leaked publicly. For the first time, researchers could run a powerful language model on consumer hardware. Llama.cpp followed days later — a pure C++ inference engine that ran on laptops.

Mid 2023

Quantization unlocks consumer hardware

GGUF quantization (4-bit, 5-bit) compressed models dramatically. A 7B parameter model that needed 14GB RAM now ran in under 5GB. Regular laptops could now run capable AI.

July 2023

🦙 Ollama launches (v0.1)

Built on top of llama.cpp, Ollama wrapped local model execution in a beautiful, Docker-inspired interface with simple commands (ollama run, ollama pull) and a REST API. It dramatically lowered the barrier to entry.

2024

Ecosystem explodes — Gemma, Mistral, Phi, Qwen

Google (Gemma), Microsoft (Phi), Mistral AI, and Alibaba (Qwen) all released competitive open-weight models. Ollama's model library grew to 100+ models. GPU acceleration support expanded for NVIDIA, AMD, and Apple Silicon.

2025

Harness tools proliferate — Codex CLI, Claude Code, Continue

AI coding assistants began supporting Ollama as a backend. You could now run a coding agent entirely offline, using your own hardware and models with zero cloud dependency.

03 — Simple Explanations

Breaking it down

Complex technology should be explainable at every level. Here's Ollama explained two ways:

🧒 Explain Like I'm 8 (ELI8)

🧸

It's like downloading a smart toy to your room

You know how Siri and Alexa live in the internet and need Wi-Fi to answer you? Ollama is like downloading a really smart robot brain onto your computer, so it lives in your bedroom.

Once it's there, you can talk to it and ask it questions — even if your Wi-Fi is off! It doesn't tell anyone what you said, because it never goes to the internet. It's your private robot helper.

Ollama is the tool that helps put those robot brains on your computer. The brains are called "models" — they're like different toys you can download, each one good at different things.

👦 Explain Like I'm 10 (ELI10)

🔬

Running AI like a local game server

You know how in Minecraft you can play on a server with friends, but you can also start your own local server on your computer? Ollama is like setting up your own private AI server.

Big AI tools like ChatGPT run on huge computers in data centers — you're basically borrowing their power. With Ollama, you download the AI "brain" (called a model) to your own PC, and your computer does all the thinking.

The cool parts: it works offline, no one can see your chats, and you can try different models like Gemma or Llama — kind of like switching between different game characters, each with different abilities.

🎓 Developer Definition

Ollama is a locally-run model inference server built on llama.cpp, that provides a Docker-like CLI for pulling, running, and managing open-weight language models. It exposes an OpenAI-compatible REST API at localhost:11434, enabling drop-in replacement for cloud APIs in development workflows.

04 — Setup & Configuration

Step-by-step installation

System Requirements

Generic recommended specs for running models locally

🖥️

CPU

8+ cores

x86-64 or ARM (Apple Silicon)

🧠

RAM

16 GB+

More = bigger models

💾

Disk

50 GB+

Models range 2GB – 30GB+

🎮

GPU (Optional)

4 GB+ VRAM

NVIDIA/AMD/Intel Iris/Apple GPU

Hardware Tier	RAM	Models You Can Run	Speed	Status
Budget Laptop	8 GB	gemma2:2b, phi3:mini, tinyllama	Slow (2–5 tok/s)	Works with patience
Mid-range Laptop	16 GB	gemma3:4b, llama3.2:3b, mistral:7b	OK (5–15 tok/s)	Good daily driver
Gaming PC / M-series Mac	32 GB	llama3.1:8b, qwen2.5:14b, gemma3:12b	Fast (15–50 tok/s)	Excellent
Workstation / Mac Studio	64 GB+	llama3.1:70b, qwen2.5:72b, deepseek	Very fast	Production grade

Install Ollama

Download and install Ollama for your platform.

# macOS / Linux — one-liner
curl -fsSL https://ollama.com/install.sh | sh

# Windows — Download installer from:
https://ollama.com/download/windows

# Verify installation
ollama --version

After install, Ollama runs as a background service and listens on localhost:11434.

Pull your first model

Choose a model from the Ollama library. For beginners, Gemma3 or Llama3.2 are excellent starting points.

# Pull a model (downloads to ~/.ollama/models)
ollama pull gemma3

# Or pull a specific size variant
ollama pull gemma3:4b
ollama pull llama3.2:3b
ollama pull mistral:7b

# List all downloaded models
ollama list

Run the model

# Interactive chat mode
ollama run gemma3

# Single prompt mode
ollama run gemma3 "Explain recursion in simple terms"

# Check what's running
ollama ps

# Stop a loaded model
ollama stop gemma3

Use the API directly

Ollama exposes an OpenAI-compatible REST API. Any tool that supports OpenAI can point to Ollama instead.

# Basic API call with curl
curl http://localhost:11434/api/generate \
  -d '{
    "model": "gemma3",
    "prompt": "What is Ollama?",
    "stream": false
  }'

# OpenAI-compatible endpoint
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma3",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Advanced: Create a custom Modelfile

A Modelfile is like a Dockerfile for AI models — define a system prompt, temperature, and parameters.

# Create a file called "Modelfile"
FROM gemma3

# Set a system prompt
SYSTEM """
You are a senior software engineer who gives
concise, accurate code reviews. Use markdown.
"""

# Tune parameters
PARAMETER temperature 0.3
PARAMETER num_ctx 8192

# Build and run your custom model
ollama create myreviewer -f Modelfile
ollama run myreviewer

Environment variables & configuration

# Change model storage location (default: ~/.ollama)
export OLLAMA_MODELS=/path/to/models

# Allow external access (LAN / other machines)
export OLLAMA_HOST=0.0.0.0:11434

# Set GPU layers (tune for your VRAM)
export OLLAMA_NUM_GPU=35

# Set number of parallel requests
export OLLAMA_NUM_PARALLEL=2

# Keep model loaded in memory (seconds)
export OLLAMA_KEEP_ALIVE="10m"

⚠️ Performance tip

Set OLLAMA_KEEP_ALIVE="-1" to keep the model permanently loaded. This eliminates the cold-start delay between prompts at the cost of persistent RAM usage.

05 — Model Catalog

What models can you run?

Ollama's library hosts 100+ models. Here are the most popular for different use cases:

gemma3

by Google · 1B / 4B / 12B / 27B

Excellent all-rounder. Great for coding, reasoning, and chat. Very efficient at smaller sizes.

llama3.2

by Meta · 1B / 3B

Fast and lightweight. Best for constrained hardware. Ideal for quick tasks and embedding.

mistral

by Mistral AI · 7B

Strong at instruction-following and reasoning. Great balance of size and capability.

qwen2.5-coder

by Alibaba · 1.5B / 7B / 32B

Specialized for code. Excellent at completion, refactoring, and debugging tasks.

phi4

by Microsoft · 14B

Punches above its weight. Excellent reasoning per GB of model size.

deepseek-r1

by DeepSeek · 8B / 32B / 70B

Chain-of-thought reasoning model. Excellent for math, logic, and analytical tasks.

💡 Pro tip

Use ollama search <keyword> to find models, or browse ollama.com/library. Model names with no tag default to :latest. Use specific tags like gemma3:4b-it-q4_K_M to control quantization level.

06 — Evaluation

The honest trade-offs

Local vs Cloud AI — At a Glance

Privacy

Local: 100% Private

Zero data leaving device

Speed (7B model)

~10 tok/s CPU

Cloud: ~80 tok/s

Cost (ongoing)

$0 / month after setup

Cloud: $20–$200+/mo

Model Quality

Good (smaller models)

Cloud: Best (GPT-4, Claude)

Offline Use

Yes — works anywhere

Cloud: Needs internet

Advantages

Complete privacy — prompts never leave your machine
No recurring costs after hardware investment
Works offline — planes, remote locations, no Wi-Fi
No rate limits or context window throttling
Fully customizable via Modelfiles and system prompts
OpenAI-compatible API — drop-in for existing tools
Run multiple models simultaneously
No censorship or content filtering from providers
Educational — understand how LLMs actually work

Shortcomings

Slower than cloud — CPU inference is notably slower than datacenter GPUs
Hardware ceiling — larger, smarter models need more RAM/VRAM
High RAM usage — a 7B model alone uses 5–8GB RAM
Model quality gap — local 7B models lag behind GPT-4 or Claude Opus
CPU heat and battery drain on laptops
Initial model downloads are large (2–30 GB per model)
No internet-connected tools (web search) without extra setup
Limited multimodal capability at smaller sizes

07 — Harness Tools

Ollama as an AI backend

Ollama's real superpower is acting as the engine for other tools. "Harness tools" are CLIs and editors that sit on top of a model API and give it agentic capabilities — like browsing files, writing code, and running commands.

OpenAI / Anthropic

Codex CLI

OpenAI's official CLI coding agent. Designed to work with GPT-4o, but supports any OpenAI-compatible endpoint — including Ollama. It scans your codebase, understands context, and can write, edit, and run code.

Supports Ollama CPU Intensive

Anthropic

Claude Code

Anthropic's agentic coding tool. Primarily uses Claude API, but can be configured to point to local models via OpenAI-compatible endpoints. Excellent for large codebases with its extended thinking capability.

Experimental Local Best with Cloud

Open Source

Continue.dev

VS Code / JetBrains extension for AI coding. First-class Ollama support. Provides autocomplete, chat, and edit modes, all powered by your local models.

Native Ollama

Community

Open WebUI

A ChatGPT-like browser interface that connects directly to Ollama. Run it locally and get a full-featured chat UI with history, RAG, and model switching — all offline.

Native Ollama

Case Study

Using Codex CLI with Ollama + Gemma

Running an open-weight coding agent entirely on local hardware — zero cloud calls, full source-code privacy. Here's exactly what happens, step by step.

Step 1 Start Ollama and load your model

$ ollama pull gemma3:4b

↓

Step 2 Install Codex CLI and configure Ollama endpoint

npm install -g @openai/codex
ollama launch codex --model gemma3:4b

↓

Step 3 Navigate to your project directory and start Codex

cd ~/Desktop/my-project
codex

↓

Step 4 Codex scans your project, builds context map > Working... analyzing file tree, reading imports, mapping dependencies

↓

Step 5 — You're live! Ask it anything about your codebase > "Find and fix the null pointer bug in auth.js"

🔍 Real-World Observation

Running Codex CLI with Gemma3:4b on a 16GB RAM laptop uses approximately 5-8GB of RAM for the model, and pushes CPU to 70–100% during inference. Responses arrive in 10–30 seconds per turn. It works — but requires patience and benefits enormously from GPU acceleration or higher-end hardware. For production use, 32GB RAM or an NVIDIA GPU with 8GB+ VRAM is recommended.

Quick Reference — Codex CLI Commands

Command	Description
ollama launch codex --model <name>	Start Codex CLI with a specific Ollama model as the backend
scan the project	Ask Codex to analyze your codebase and build understanding
find and fix a bug in @filename	Ask Codex to diagnose and patch bugs in a specific file
write tests for @filename	Generate unit tests for a given module
/model	Switch to a different Ollama model mid-session
Esc	Interrupt a running inference

Quick Reference — Ollama Commands

Command	Description
ollama run <model>	Start an interactive chat session with a model
ollama pull <model>	Download a model from the Ollama library
ollama list	Show all downloaded models
ollama ps	Show currently loaded models and resource usage
ollama rm <model>	Delete a model from local storage
ollama create <name> -f Modelfile	Create a custom model from a Modelfile
ollama show <model>	Display model metadata, parameters, and Modelfile
ollama serve	Start the Ollama server manually (usually auto-started)

08 — Further Learning

Go deeper

The local AI ecosystem moves fast. Here are the best places to keep learning:

🦙

Official Docs

Ollama Documentation

Complete reference for commands, API, Modelfile format, and GPU setup guides.

ollama.com/docs

🧪

Model Library

Ollama Model Hub

Browse and search 100+ models with sizes, benchmarks, and pull commands.

ollama.com/library

🐙

Open Source

Ollama on GitHub

Source code, issues, community integrations, and contribution guides.

github.com/ollama/ollama

🖥️

GUI Tool

Open WebUI

A beautiful ChatGPT-style interface that runs locally on top of Ollama.

github.com/open-webui/open-webui

💻

VS Code Extension

Continue.dev

Integrate Ollama with VS Code or JetBrains for AI autocomplete and chat.

continue.dev

📰

Community

r/LocalLLaMA

The most active community for local LLM enthusiasts. Tips, benchmarks, model comparisons.

reddit.com/r/LocalLLaMA

📚

Research

Hugging Face

The home of open model weights, datasets, and leaderboards for model comparison.

huggingface.co

⚡

Inference Engine

llama.cpp

The C++ engine that powers Ollama under the hood. For advanced users who want direct control.

github.com/ggerganov/llama.cpp

🤖

Harness Tool

Codex CLI

OpenAI's terminal coding agent. Works with Ollama as an open-weight backend.

github.com/openai/codex

Glossary — Quick Reference

Term	Meaning
LLM	Large Language Model — a neural network trained on text to predict and generate language (GPT, Gemma, Llama, etc.)
Inference	Running a trained model to generate outputs. "Local inference" means your CPU/GPU does this, not a remote server.
Quantization	Compressing model weights from 32-bit floats to 4-bit integers, reducing RAM requirements ~4–8x with minimal quality loss.
GGUF	The file format Ollama uses to store quantized models. Designed for efficient CPU inference.
Context Window	How many tokens (words) the model can "see" at once. Larger = more memory needed. Configured via `num_ctx`.
Modelfile	A configuration file for customizing a model's behavior — like a Dockerfile for AI. Defines system prompt, parameters, etc.
Open-weight	Models whose weights (parameters) are publicly released. "Open source AI" — you can download and run them yourself.
Harness tool	A CLI or app that wraps a model API and gives it agentic capabilities (file access, code execution, tool calling).
tok/s	Tokens per second — a measure of inference speed. 10 tok/s ≈ ~7 words/second. Cloud APIs do 50–100+ tok/s.

Run AI Locally.Own Your Intelligence.

Your laptop as an AI server

How we got here

Breaking it down

It's like downloading a smart toy to your room

Running AI like a local game server

🎓 Developer Definition

Step-by-step installation

Install Ollama

Pull your first model

Run the model

Use the API directly

Advanced: Create a custom Modelfile

Environment variables & configuration

What models can you run?

The honest trade-offs

Advantages

Shortcomings

Ollama as an AI backend

Codex CLI

Claude Code

Continue.dev

Open WebUI

Using Codex CLI with Ollama + Gemma

Quick Reference — Codex CLI Commands

Quick Reference — Ollama Commands

Go deeper

Run AI Locally.
Own Your Intelligence.